Note: Datset donated by Ron Kohavi and Barry Becker, from the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". Small changes to the dataset have been made, such as removing the
'fnlwgt'feature and records with missing or ill-formatted entries.
import numpy as np # Package for numerical computing with Python
import pandas as pd # Package to work with data in tabular form and the like
from scipy.stats import skew
from time import time # Package to work with time values
from IPython.display import display # Allows the use of display() for DataFrames
import matplotlib.pyplot as plt # Package for plotting
import seaborn as sns # Package for plotting, prettier than matplotlib
import visuals as vs # Adapted from Udacity
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler
# iPython Notebook formatting
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Account for changes made to imported packages
%load_ext autoreload
%autoreload 2
data = pd.read_csv("census.csv")
data.info(null_counts=True) # Show information for each factor: NaN counts and data-type of column
data.describe(include='all').T # Summarize each factor, transpose the summary (personal preference)
n_records = data.shape[0] # First element of .shape indicates n
n_greater_50k = data[data['income'] == '>50K'].shape[0] # n of those with income > 50k
n_at_most_50k = data.where(data['income'] == '<=50K').dropna().shape[0] # .where method requires dropping of na for this
greater_percent = round((n_greater_50k / n_records)*100,2) # Show proportion of > 50k to whole data
data_details = {"Number of observations": n_records,
"Number of people with income > 50k": n_greater_50k,
"Number of people with income <= 50k": n_at_most_50k,
"Percent of people with income > 50k": greater_percent} # Cache values of analysis
for item in data_details: # Iterate through the cache
print("{0}: {1}".format(item, data_details[item])) # Print the values
sns.pairplot(data)
plt.show()
Before this data can be used for modeling and application to machine learning algorithms, it must be cleaned, formatted, and structured.
Factor names with special characters, like -, can cause issues, so a cleaning may prove helpful.
name_changes = {x: x.replace("-", "_") for x in data.columns.tolist() if "-" in x}
data = data.rename(columns=name_changes)
Working with categorical variables often involves transforming strings to some other value, frequently 0 or 1 for binomial factors, and {X = x_{0}, x_{1}, ..., x_{n} | 0, 1, .. n} multinomial.
These values may be ordinal (i.e. values with relationships that can be compared as a ranking, e.g. worst, better, best), or nominal (i.e. values indicate a state, e.g. blue, green, yellow).
ord_vars = ['income', 'workclass', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country']
nom_vars = ['education_level']
map_dict = {}
for name in ord_vars:
map_dict[name] = {category:number for number,category in enumerate(data[name].unique())}
# map_dict
ed_lev_cat = {' Doctorate': 0,
' Prof-school': 1,
' Masters': 2,
' Bachelors': 3,
' Assoc-voc': 4,
' Assoc-acdm': 5,
' Some-college': 6,
' HS-grad': 7,
' 12th': 8,
' 11th': 9,
' 10th': 10,
' 9th': 11,
' 7th-8th': 12,
' 5th-6th': 13,
' 1st-4th': 14,
' Preschool': 15}
map_dict['education_level'] = ed_lev_cat
for name in map_dict:
data['numeric_' + name] = data[name].map(map_dict[name])
for name in map_dict.keys():
if name != 'native_country':
message = 'Mapping for variable: numeric_{}'.format(name)
print("=" * len(message))
print(message)
map_df = pd.DataFrame.from_dict(map_dict[name], orient='index').reset_index().rename(columns={'index': 'Factor Value', 0: 'Numerical Value'}).sort_values(by=['Numerical Value'])
display(map_df)
For training an algorithm, it is useful to separate the label, or dependent variable ($Y$) from the rest of the data training_features, or independent variables ($X$).
Y_income = data[['income', 'numeric_income']]
X = data.drop(['income', 'numeric_income'], axis=1)
The features capital_gain and capital_loss are positively skewed (i.e. have a long tail in the positive direction).
To reduce this skew, a logarithmic transformation, $\tilde x = \ln\left(x\right)$, can be applied. This transformation will reduce the amount of variance and pull the mean closer to the center of the distribution.
Why does this matter: The extreme points may affect the performance of the predictive model.
Why care: We want an easily discernible relationship between the independent and dependent variables; the skew makes that more complicated.
Why DOESN'T this matter: The distribution of the independent variables is not an assumption of most models, but the distribution of the residuals and homoskedasticity of the independent variable, given the independent variables, $E\left(u | x\right) = 0$ where $u = Y - \hat{Y}$ is of linear regression. In this analysis, the dependent variable is categorical (i.e. discrete or non-continuous) and linear regression is not an appropriate model.
cap_loss = X['capital_loss']
cap_gain = X['capital_gain']
cap_loss_skew, cap_loss_var, cap_loss_mean = skew(cap_loss), np.var(cap_loss), np.mean(cap_loss)
cap_gain_skew, cap_gain_var, cap_gain_mean = skew(cap_gain), np.var(cap_gain), np.mean(cap_gain)
fac_df = pd.DataFrame({'Feature': ['Capital Loss', 'Capital Gain'],
'Skewness': [cap_loss_skew, cap_gain_skew],
'Mean': [cap_loss_mean, cap_gain_mean],
'Variance': [cap_loss_var, cap_gain_var]})
display(fac_df)
fig = make_subplots(rows=2, cols=1)
fig.update_layout(height=800, width=950,
title_text="Skewed Distributions of Continuous Census Data Features",
showlegend=False
)
fig.add_trace(
go.Histogram(x=X['capital_loss'], nbinsx=25,
name='Capital-Loss'),
row=1, col=1
)
fig.add_trace(
go.Histogram(x=X['capital_gain'], nbinsx=25,
name='Capital-Gain'),
row=2, col=1
)
fig.update_xaxes(title_text="Capital-Loss Feature Distribution", row=1, col=1)
fig.update_xaxes(title_text="Capital-Gain Feature Distribution", row=2, col=1)
for i in range(1,5):
fig.update_yaxes(title_text="Number of Records", range=[0, 2000],
patch = dict(
tickmode = 'array',
tickvals = [0, 500, 1000, 1500, 2000],
ticktext = [0, 500, 1000, 1500, ">2000"]),
row=i, col=1)
fig.show()
Again, to reduce this skew, a logarithmic transformation, $\tilde x = \ln\left(x\right)$, can be applied. This transformation will reduce the amount of variance and pull the mean closer to the center of the distribution.
skewed = ['capital_gain', 'capital_loss']
X_log_transformed = pd.DataFrame(data=X).copy()
X_log_transformed[skewed] = X[skewed].apply(lambda x : np.log(x + 1))
fig = make_subplots(rows=2, cols=1)
fig.update_layout(height=800, width=950,
title_text="Skewed Distributions of Continuous Census Data Features",
showlegend=False
)
fig.add_trace(
go.Histogram(x=X_log_transformed['capital_loss'], nbinsx=25,
name='Log of Capital-Loss'),
row=1, col=1
)
fig.add_trace(
go.Histogram(x=X_log_transformed['capital_gain'], nbinsx=25,
name='Log of Capital-Gain'),
row=2, col=1
)
fig.update_xaxes(title_text="Log of Capital-Loss Feature Distribution", row=1, col=1)
fig.update_xaxes(title_text="Log of Capital-Gain Feature Distribution", row=2, col=1)
for i in range(1,3):
fig.update_yaxes(title_text="Number of Records", range=[0, 2000],
patch = dict(
tickmode = 'array',
tickvals = [0, 500, 1000, 1500, 2000],
ticktext = [0, 500, 1000, 1500, ">2000"]),
row=i, col=1)
fig.show()
log_cap_loss_skew = skew(X_log_transformed['capital_loss'])
log_cap_loss_var = round(np.var(X_log_transformed['capital_loss']),5)
log_cap_loss_mean = np.mean(X_log_transformed['capital_loss'])
log_cap_gain_skew = skew(X_log_transformed['capital_gain'])
log_cap_gain_var = round(float(np.var(X_log_transformed['capital_gain'])),5)
log_cap_gain_mean = np.mean(X_log_transformed['capital_gain'])
log_fac_df = pd.DataFrame({'Feature': ['Log Capital Loss', 'Log Capital Gain'],
'Skewness': [log_cap_loss_skew, log_cap_gain_skew],
'Mean': [log_cap_loss_mean, log_cap_gain_mean],
'Variance': [log_cap_loss_var, log_cap_gain_var]})
fac_df = fac_df.append(log_fac_df, ignore_index=True)
fac_df['Variance'] = fac_df['Variance'].apply(lambda x: '%.5f' % x)
display(fac_df)
fig = make_subplots(rows=4, cols=1)
fig.update_layout(height=800, width=950,
title_text="Comparison of Distributions of Continuous Census Data Features",
showlegend=False
)
fig.add_trace(
go.Histogram(x=X['capital_loss'], nbinsx=25,
name='Capital-Loss'),
row=1, col=1
)
fig.add_trace(
go.Histogram(x=X_log_transformed['capital_loss'], nbinsx=25,
name='Log of Capital-Loss'),
row=2, col=1
)
fig.add_trace(
go.Histogram(x=X['capital_gain'], nbinsx=25,
name='Normalized Capital-Gain'),
row=3, col=1
)
fig.add_trace(
go.Histogram(x=X_log_transformed['capital_gain'], nbinsx=25,
name='Capital-Gain'),
row=4, col=1
)
fig.update_xaxes(title_text="Capital-Loss Feature Distribution", row=1, col=1)
fig.update_xaxes(title_text="Log of Capital-Loss Feature Distribution", row=2, col=1)
fig.update_xaxes(title_text="Capital-Gain Feature Distribution", row=3, col=1)
fig.update_xaxes(title_text="Log of Capital-Gain Feature Distribution", row=4, col=1)
for i in range(1,5):
fig.update_yaxes(title_text="Number of Records", range=[0, 2000],
patch = dict(
tickmode = 'array',
tickvals = [0, 500, 1000, 1500, 2000],
ticktext = [0, 500, 1000, 1500, ">2000"]),
row=i, col=1)
fig.show()
Originally, the influence of capital_loss on income was statistically significant, but after the logarithmic transformation, it is not.
Here it can be seen that with a change to the skew, the confidence interval now passes through zero whereas before it did not.
This passing through zero is interpreted as the independent variable being statistically indistinguishable from zero influence on the dependent variable.
train_0 = X['capital_loss']
logit_0 = sm.Logit(Y_income['numeric_income'], train_0)
train_1 = X_log_transformed['capital_loss']
logit_1 = sm.Logit(Y_income['numeric_income'], train_1)
# fit the model
result_0 = logit_0.fit(disp=0)
result_1 = logit_1.fit(disp=0)
# Results
print()
print("Original model")
print(result_0.summary2())
print()
print("Transformed model")
print(result_1.summary2())
| Feature | Skewness | Mean | Variance |
|---|---|---|---|
| Capital Loss | 4.516154 | 88.595418 | 163985.81018 |
| Capital Gain | 11.788611 | 1101.430344 | 56345246.60482 |
| Log Capital Loss | 4.271053 | 0.355489 | 2.54688 |
| Log Capital Gain | 3.082284 | 0.740759 | 6.08362 |
These two terms, normalization and standardization, are frequently used interchangably, but have two different scaling purposes.
Earlier, capital_gain and capital_loss were transformed logarithmically, reducing their skew, and affecting the model's predictive power (i.e. ability to discern the relationship between the dependent and independent variables).
Another method of influencing the model's predictive power is normalization of independent variables which are numerical. Whereafter, each featured will be treated equally in the model.
However, after scaling is applied, observing the data in its raw form will no longer have the same meaning as before.
Note the output from scaling. age is no longer 39 but is instead 0.30137. This value is meaningful only in context of the rest of the data and not on its own.
scaler = MinMaxScaler(feature_range=(0, 1)) # default=(0, 1)
numerical = ['age', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']
X_log_minmax = pd.DataFrame(data = X_log_transformed).copy()
X_log_minmax[numerical] = scaler.fit_transform(X_log_transformed[numerical])
print("Original Data")
display(X.head(1))
# Show an example of a record with scaling applied
print("=" * 86)
print("Scaled Data")
display(X_log_minmax.head(1))
fig = make_subplots(rows=4, cols=1)
fig.update_layout(height=800, width=950,
title_text="Comparison of Distributions of Continuous Census Data Features",
showlegend=False
)
fig.add_trace(
go.Histogram(x=X_log_transformed['capital_loss'], nbinsx=25,
name='Log of Capital-Loss'),
row=1, col=1
)
fig.add_trace(
go.Histogram(x=X_log_minmax['capital_loss'], nbinsx=25,
name='Normalized Capital-Loss'),
row=2, col=1
)
fig.add_trace(
go.Histogram(x=X_log_transformed['capital_gain'], nbinsx=25,
name='Log of Capital-Gain'),
row=3, col=1
)
fig.add_trace(
go.Histogram(x=X_log_minmax['capital_gain'], nbinsx=25,
name='Normalized Capital-Gain'),
row=4, col=1
)
fig.update_xaxes(title_text="Log of Capital-Loss Feature Distribution", row=1, col=1)
fig.update_xaxes(title_text="Normalized Capital-Loss Feature Distribution", row=2, col=1)
fig.update_xaxes(title_text="Log of Capital-Gain Feature Distribution", row=3, col=1)
fig.update_xaxes(title_text="Normalized Capital-Gain Feature Distribution", row=4, col=1)
for i in range(1,5):
fig.update_yaxes(title_text="Number of Records", range=[0, 2000],
patch = dict(
tickmode = 'array',
tickvals = [0, 500, 1000, 1500, 2000],
ticktext = [0, 500, 1000, 1500, ">2000"]),
row=i, col=1)
fig.show()
Earlier, I transformed some categorical values into a numeric mapping. Another, perhaps more common, way to do this is to make dummy variables from the values of those factors. Pandas has a simple method, .get_dummies(), that can perform this very quickly.
To note, this will create a new variable for every value a categorical variable takes:
| someFeature | someFeature_A | someFeature_B | someFeature_C | ||
|---|---|---|---|---|---|
| 0 | B | 0 | 1 | 0 | |
| 1 | C | ----> one-hot encode ----> | 0 | 0 | 1 |
| 2 | A | 1 | 0 | 0 |
Which means the p, or number of factors, will grow, and can do so potentially in a large way.
It is also worth noting that for modeling, it is important that once value of the factor, a "base case", be dropped from the data. This is because the base case is redundant, i.e. can be infered perfectly from the other cases, and, more specifically and more detrimental to our model, it leads to multicollinearity of the terms.
In some models (e.g. logistic regression, linear regression), an assumption of no multicollinearity must hold.
factors = ['age', 'workclass', 'education_level', 'education_num', 'marital_status',
'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
'hours_per_week', 'native_country',]
# Create dummies, droping the base case
X_trans = pd.get_dummies(X_log_minmax[factors], drop_first=True)
Y = Y_income['numeric_income']
# Print the number of features after one-hot encoding
encoded = list(X_trans.columns)
print("{} total features after one-hot encoding.".format(len(encoded)))
After transforming with one-hot-encoding, all categorical variables have been converted into numerical features. Earlier, they were normalized (i.e. scaled between 0 and 1).
Next, for training a machine learning model, it is necessary to split the data into segments. One segment will be used for training the model, the training set, and the other set will be for testing the mode, the testing set.
A common method of splitting is to segment based on proportion of data. A general 80:20 rule is typical for training:test.
sklearn has a method that works well for this, .model_selection.train_test_split. Essentially, this randomly selects a portion of the data to segment to a training and to a testing set.
By setting a seed, option random_state, we can ensure the random splitting is the same for our model. This is necessary for evaluating the effectiveness of the model. Otherwise, we would be training and testing a model with the same proportional split (if we kept that static), but with different observations of the data.
# # Full Page - Code
!jupyter nbconvert WIP_Donor_Classification.ipynb --output WIP_Class_Code --reveal-prefix=reveal.js --SlidesExporter.reveal_theme=serif --SlidesExporter.reveal_scroll=True --SlidesExporter.reveal_transition=none
# # Full Page - No Code
!jupyter nbconvert WIP_Donor_Classification.ipynb --output WIP_Class_No_Code --reveal-prefix=reveal.js --SlidesExporter.reveal_theme=serif --SlidesExporter.reveal_scroll=True --SlidesExporter.reveal_transition=none --TemplateExporter.exclude_input=True
# # Slides - No Code
!jupyter nbconvert --to slides WIP_Donor_Classification.ipynb --output WIP_Class_Slides --TemplateExporter.exclude_input=True --SlidesExporter.reveal_transition=none --SlidesExporter.reveal_scroll=True